Skip to content

SOLR-18060: Add Prometheus metrics to CrossDC Consumer.#4063

Merged
sigram merged 16 commits intoapache:mainfrom
sigram:jira/solr-18060-2
Feb 4, 2026
Merged

SOLR-18060: Add Prometheus metrics to CrossDC Consumer.#4063
sigram merged 16 commits intoapache:mainfrom
sigram:jira/solr-18060-2

Conversation

@sigram
Copy link
Contributor

@sigram sigram commented Jan 19, 2026

This PR replaces Dropwizard JSON metrics with Prometheus metrics in the CrossDC Consumer, using directly the Prometheus client_java API. It also removes the Dropwizard dependency.

Copy link
Contributor

@mlbiscoc mlbiscoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you post a sample of all these metrics? Either dump it here or in a txt file? It would be easier to review the names and labels on the metrics.

Counter.builder()
.name("consumer_input_total")
.help("Total number of input messages")
.labelNames("type", "subtype")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I question most of these metrics really need type label. What is the cardinality of it and possible different combinations? I see in the test UPDATE is one. Is there also QUERY or something along those lines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there's ADMIN and CONFIGSET.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm ok. I am not a fan of naming this label being called type. I think it should have some kind of context what it means as type and subtype can be very generic. Is it an operation or message_type maybe? Then what can subtype be? In core, I made it category but it is debateable if we should just remove that label/attribute all together from metrics. If you move type to something more specific then maybe you can just move off subtype to type. Again seeing an sample text output of these metrics would help if you can.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a request type - there are currently three types: UPDATE, ADMIN and CONFIGSET. Sub-type is primarily for UPDATE (add, dbi, dbq) and ADMIN (path).

Here's a sample output:

# HELP crossdc_consumer_input_total Total number of input messages
# TYPE crossdc_consumer_input_total counter
crossdc_consumer_input_total{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE"} 1.0
# HELP crossdc_consumer_output_total Total number of output requests
# TYPE crossdc_consumer_output_total counter
crossdc_consumer_output_total{otel_scope_name="org.apache.solr",result="handled",type="UPDATE"} 1.0
# HELP crossdc_consumer_output_batch_size Histogram of output batch sizes
# TYPE crossdc_consumer_output_batch_size histogram
crossdc_consumer_output_batch_size_bucket{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE",le="0.0"} 0
crossdc_consumer_output_batch_size_bucket{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE",le="5.0"} 1
crossdc_consumer_output_batch_size_bucket{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE",le="10.0"} 1
crossdc_consumer_output_batch_size_bucket{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE",le="25.0"} 1
crossdc_consumer_output_batch_size_bucket{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE",le="50.0"} 1
crossdc_consumer_output_batch_size_bucket{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE",le="75.0"} 1
crossdc_consumer_output_batch_size_bucket{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE",le="100.0"} 1
crossdc_consumer_output_batch_size_bucket{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE",le="250.0"} 1
crossdc_consumer_output_batch_size_bucket{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE",le="500.0"} 1
crossdc_consumer_output_batch_size_bucket{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE",le="750.0"} 1
crossdc_consumer_output_batch_size_bucket{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE",le="1000.0"} 1
crossdc_consumer_output_batch_size_bucket{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE",le="2500.0"} 1
crossdc_consumer_output_batch_size_bucket{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE",le="5000.0"} 1
crossdc_consumer_output_batch_size_bucket{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE",le="7500.0"} 1
crossdc_consumer_output_batch_size_bucket{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE",le="10000.0"} 1
crossdc_consumer_output_batch_size_bucket{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE",le="+Inf"} 1
crossdc_consumer_output_batch_size_count{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE"} 1
crossdc_consumer_output_batch_size_sum{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE"} 1.0
# HELP crossdc_consumer_output_first_attempt_time_nanoseconds Histogram of first attempt request times
# TYPE crossdc_consumer_output_first_attempt_time_nanoseconds histogram
crossdc_consumer_output_first_attempt_time_nanoseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="0.0"} 0
crossdc_consumer_output_first_attempt_time_nanoseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="5000.0"} 0
crossdc_consumer_output_first_attempt_time_nanoseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="10000.0"} 0
crossdc_consumer_output_first_attempt_time_nanoseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="25000.0"} 0
crossdc_consumer_output_first_attempt_time_nanoseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="50000.0"} 0
crossdc_consumer_output_first_attempt_time_nanoseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="100000.0"} 0
crossdc_consumer_output_first_attempt_time_nanoseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="250000.0"} 0
crossdc_consumer_output_first_attempt_time_nanoseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="500000.0"} 0
crossdc_consumer_output_first_attempt_time_nanoseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="1000000.0"} 0
crossdc_consumer_output_first_attempt_time_nanoseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="2500000.0"} 0
crossdc_consumer_output_first_attempt_time_nanoseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="5000000.0"} 0
crossdc_consumer_output_first_attempt_time_nanoseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="2.5E7"} 0
crossdc_consumer_output_first_attempt_time_nanoseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="1.0E8"} 0
crossdc_consumer_output_first_attempt_time_nanoseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="1.0E9"} 0
crossdc_consumer_output_first_attempt_time_nanoseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="+Inf"} 1
crossdc_consumer_output_first_attempt_time_nanoseconds_count{otel_scope_name="org.apache.solr",type="UPDATE"} 1
crossdc_consumer_output_first_attempt_time_nanoseconds_sum{otel_scope_name="org.apache.solr",type="UPDATE"} 1.7667254470782164E18
# HELP crossdc_consumer_output_time_milliseconds Histogram of output request times
# TYPE crossdc_consumer_output_time_milliseconds histogram
crossdc_consumer_output_time_milliseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="0.0"} 0
crossdc_consumer_output_time_milliseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="5.0"} 0
crossdc_consumer_output_time_milliseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="10.0"} 0
crossdc_consumer_output_time_milliseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="25.0"} 1
crossdc_consumer_output_time_milliseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="50.0"} 1
crossdc_consumer_output_time_milliseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="75.0"} 1
crossdc_consumer_output_time_milliseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="100.0"} 1
crossdc_consumer_output_time_milliseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="250.0"} 1
crossdc_consumer_output_time_milliseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="500.0"} 1
crossdc_consumer_output_time_milliseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="750.0"} 1
crossdc_consumer_output_time_milliseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="1000.0"} 1
crossdc_consumer_output_time_milliseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="2500.0"} 1
crossdc_consumer_output_time_milliseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="5000.0"} 1
crossdc_consumer_output_time_milliseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="7500.0"} 1
crossdc_consumer_output_time_milliseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="10000.0"} 1
crossdc_consumer_output_time_milliseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="+Inf"} 1
crossdc_consumer_output_time_milliseconds_count{otel_scope_name="org.apache.solr",type="UPDATE"} 1
crossdc_consumer_output_time_milliseconds_sum{otel_scope_name="org.apache.solr",type="UPDATE"} 13.0
# TYPE target_info gauge
target_info{service_name="unknown_service:java",telemetry_sdk_language="java",telemetry_sdk_name="opentelemetry",telemetry_sdk_version="1.56.0"} 1

public static final String ATTR_SUBTYPE = "subtype";
public static final String ATTR_RESULT = "result";

protected final Map<String, Attributes> attributesCache = new ConcurrentHashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid repeatedly creating millions of small objects (Attributes) when updating metrics.

@sigram sigram requested a review from mlbiscoc January 29, 2026 13:35
Copy link
Contributor

@mlbiscoc mlbiscoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did another run through. Liking the changes. Just a few more comments.

log.trace("result=nothandled_shutdown");
}
metrics.counter(MetricRegistry.name(type.name(), "nothandled_shutdown")).inc();
metrics.incrementOutputCounter(type.name(), "nothandled_shutdown");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe unhandled_shutdown instead of nothandled

Comment on lines +45 to +52
protected LongCounter inputMsg;
protected LongCounter inputReq;
protected LongCounter collapsed;
protected LongCounter output;
protected LongHistogram outputBatchSizeHistogram;
protected LongHistogram outputTimeHistogram;
protected LongHistogram outputBackoffHistogram;
protected LongHistogram outputFirstAttemptHistogram;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had created this Attributed instrument wrappers so you could bind attributes to instruments which may have worked here and how all of Solr core does it. I probably should have mentioned that earlier but my fault. Honestly not a blocker and fine with this direction.


if (status != 0) {
metrics.counter(MetricRegistry.name(type.name(), "outputErrors")).inc();
metrics.incrementOutputCounter(type.name(), "solrError");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

solr_error, or maybe just error. Should be known its in the context of Solr already

Comment on lines 36 to 40
// implementation libs.prometheus.metrics.model
// implementation(libs.prometheus.metrics.expositionformats, {
// exclude group: "io.prometheus", module: "prometheus-metrics-shaded-protobuf"
// exclude group: "io.prometheus", module: "prometheus-metrics-config"
// })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove these comments

Comment on lines 501 to 502
prometheus-metrics-core = { module = "io.prometheus:prometheus-metrics-core", version.ref = "prometheus-metrics" }
prometheus-metrics-exporter-servlet-jakarta = { module = "io.prometheus:prometheus-metrics-exporter-servlet-jakarta", version.ref = "prometheus-metrics" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this? Otherwise just remove it.

@sigram sigram merged commit f1c4d25 into apache:main Feb 4, 2026
5 of 9 checks passed
sigram added a commit to sigram/solr that referenced this pull request Feb 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants